-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove openblas set_num_threads in julia __init__ #42442
Conversation
This needs to be followed up with increasing of the max num threads as discussed in this PR, so that users can start getting good linear algebra performance out of the box. |
Getting 2x |
The Buildkite Are the failures related to the contents of this PR? |
👍 Do we want to do anything if the number of threads is specified, e.g. |
I think we should leave it separate unless and until we can integrate OpenBlas threads with Julia threads. |
What's the issue with the single thread failure? |
I don't know, but I've retried it three times now, and it fails every time. The same job ( |
It looks like the failure is happening in the Distributed test set. @vchuravy could you take a look? |
|
Yeah @ViralBShah that test in Distributed.jl needs to be adjusted, since it is explicitly testing that invariant. |
What would be the most correct way to fix the test? |
Note impact on startup times discussed in JuliaPackaging/Yggdrasil#3667 (comment) This will remove the cap of 8 threads, and thus add something like 50ms more when using 32 threads (if the hardware does have that many cores). |
For Buildkite CI, we probably still want to limit the OpenBLAS thread count to avoid each Buildkite job from starting e.g. 64 OpenBLAS threads. |
Personally, I think adding a 25% increased startup time for everyone with high core system is a bit much. It feels more reasonable to opt into that if you know you think you think you are going to use it. A lot of work has gone in to push that startup time down. |
I should also point out that this 50ms startup time is not purely a startup time hit. It includes doing If I do a pure startup time test, there is no perceptible difference. I get 0.16s with 8 threads and 0.17s with 32 threads on a fairly old machine. In many runs I get the same time for 8 threads and 32 threads, which suggests a difference on the order of 5ms. I feel this is reasonable - that there is a little bit more of an impact when you enter the BLAS, but pure startup remains unaffected. |
I don't think this should necessarily be backported to 1.6. Changing the default behaviour like that seems okay for a minor release, but not in a patch release. |
I don't think it should go in 1.7 either. At least not in 1.7. 0. The impact on loading times is unclear and since this has not been tested on master at all it introduces some uncertainty and we don't want that for a heavily delayed release. |
Could OpenBLAS be set up to initialize lazily, i.e. it waits to spawn threads until you call an OpenBLAS function for the first time? That way you wouldn't pay this startup cost for code that is not using linear algebra. |
* Remove openblas_set_num_threads in julia __init__ * Remove test no longer needed.
* Remove openblas_set_num_threads in julia __init__ * Remove test no longer needed.
* Remove openblas_set_num_threads in julia __init__ * Remove test no longer needed.
We used to set the number of threads to a conservative number, since openblas thread detection used to be less reliable than it is today. Hopefully this leads to better out of the box performance, since it is quite common for people to have more than 8 cores!
cc @chriselrod @staticfloat